Introduction

According to the Center for Disease Control and Prevention (CDC), cardiovascular or heart diseases account for 1 in every 4 deaths in the United States annually. The estimated costs to the nation’s economy amount to about $363 billion annually, which include hospitalizations, after services health care needs as well as lost productivity due to death and disability (SS Virani, 2021). The CDC also states that given the lifestyle habits of an average American, about half of the population (47%) are reported to have “1 of the 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking”. However, the risk factors for developing heart diseases are vast and diverse and tend to vary between most racial and ethnic groups in the United States.

Therefore, there is an immediate need to detect the major risk factors early on, so that a wellness-oriented healthcare system can be instituted. Ideally, such a system would assess the intensity of the risk factors (compared to a baseline) and focus efforts on preventing the occurrence of developing heart diseases in the first place rather than minimizing second occurrences (Liau et al, 2010; Giampaoli et al, 2005 ).

The beginnings of the current understanding of the interconnected role of risk factors was first discussed in the Framingham Heart Study. Started in 1948, the study’s original goal was “ to identify common factors or characteristics that contribute to cardiovascular disease” (Hong 2018). However, today the FHS has morphed into a multigenerational study gathering genetic information that help analyze family patterns of certain diseases, including cardiovascular ailments (National Heart, Lung and Blood Institute). Jumping off from this, current research in the field has focused on using precision medicine algorithms along with computational development technology to assess a patient’s likelihood of developing heart disease given certain lifestyle factors. Most of the studies have focused largely on Caucasian individuals as a sample and indicating that there is a gap in the usefulness of current risk assessment tools working for individuals of different races.

In this paper, we aim to understand the different risk factors that affect the occurrence of heart diseases among the US population, so that we may be able to see some of the intermingling between different risk factors.

Description of Dataset

The original data set can be retrieved from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) annual health survey. The survey interviews over four hundred thousand individuals in the USA asking questions regarding the risk factors of heart disease. Conducted in 2020 (published on February 15, 2021), the original data set has been cleaned from its original 300 variables to 18 variables by Kamil Pytlak and can be accessed on Kaggle .

# Load the Data 
heart <-read.csv("heart_2020_cleaned.csv")

# Descriptive Statistics  
status (heart)
##                          variable q_zeros p_zeros q_na p_na q_inf p_inf
## HeartDisease         HeartDisease       0   0.000    0    0     0     0
## BMI                           BMI       0   0.000    0    0     0     0
## Smoking                   Smoking       0   0.000    0    0     0     0
## AlcoholDrinking   AlcoholDrinking       0   0.000    0    0     0     0
## Stroke                     Stroke       0   0.000    0    0     0     0
## PhysicalHealth     PhysicalHealth  226589   0.709    0    0     0     0
## MentalHealth         MentalHealth  205401   0.642    0    0     0     0
## DiffWalking           DiffWalking       0   0.000    0    0     0     0
## Sex                           Sex       0   0.000    0    0     0     0
## AgeCategory           AgeCategory       0   0.000    0    0     0     0
## Race                         Race       0   0.000    0    0     0     0
## Diabetic                 Diabetic       0   0.000    0    0     0     0
## PhysicalActivity PhysicalActivity       0   0.000    0    0     0     0
## GenHealth               GenHealth       0   0.000    0    0     0     0
## SleepTime               SleepTime       0   0.000    0    0     0     0
## Asthma                     Asthma       0   0.000    0    0     0     0
## KidneyDisease       KidneyDisease       0   0.000    0    0     0     0
## SkinCancer             SkinCancer       0   0.000    0    0     0     0
##                       type unique
## HeartDisease     character      2
## BMI                numeric   3604
## Smoking          character      2
## AlcoholDrinking  character      2
## Stroke           character      2
## PhysicalHealth     integer     31
## MentalHealth       integer     31
## DiffWalking      character      2
## Sex              character      2
## AgeCategory      character     13
## Race             character      6
## Diabetic         character      4
## PhysicalActivity character      2
## GenHealth        character      5
## SleepTime          integer     24
## Asthma           character      2
## KidneyDisease    character      2
## SkinCancer       character      2

From the table we can see that physical and mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set report being physically sick in a 30 day period while about 60% of the respondents report feeling mentally unwell in that same time period.

We can also see that there are no N/A values in our data set. This is not surprising given that our data was cleaned before being retrieved.

Below are the plots showing the means of our numerical variables. We can see that, on average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in a 30 day period is slightly higher for people with heart disease, than for people without.

# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)

heart %>%
  group_by(HeartDisease) %>%
  summarise_at(vars("BMI", 
                    "PhysicalHealth", 
                    "MentalHealth",
                    "SleepTime"), mean)
## # A tibble: 2 × 5
##   HeartDisease   BMI PhysicalHealth MentalHealth SleepTime
##   <chr>        <dbl>          <dbl>        <dbl>     <dbl>
## 1 No            28.2           2.96         3.83      7.09
## 2 Yes           29.4           7.81         4.64      7.14
# Changing Diabetes bordeline and pregnancy categories to Yes/No

heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"  

# Changing all the categorical variables to numerical dummies 

heart$HeartDisease<-ifelse(heart$HeartDisease=="Yes",1,0)
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)

Gender and Lifestyle Factors

According to Noel Bairey Merz, MD, director of the Barbra Streisand Women’s Heart Center in the Smidt Heart Institute, women experience heart disease more severely than men. Meaning that even though men are more likely to suffer heart attacks, women are more likely to die from a heart attack than a man.

We started off by trying to see which gender tends to have more heart disease based on symptoms reported just before diagnosis. From the table below, we can see that we have a lot more data for people without heart disease than with, this will be addressed later. However, even in the percentage table for those who do have heart disease, we see that there are more men with heart disease (about 5%) compared to women (3%). This is in line with our basic theory that men have more heart disease. However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? Discussions of the relevant variables are provided below.

Proportion of Heart Disease Among the Genders

contable = table(heart$HeartDisease, heart$Sex)
xkabledply(contable, title="Contingency Table for Heart Disease and Sex")
Contingency Table for Heart Disease and Sex
Female Male
0 156571 135851
1 11234 16139
addmargins(contable)
##      
##       Female   Male    Sum
##   0   156571 135851 292422
##   1    11234  16139  27373
##   Sum 167805 151990 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)

#print ("Contingency Table for Heart Disease and Sex in Percentage")
print(percent, main="Contingency Table for Heart Disease and Sex in Percentage")
##    
##     Female  Male
##   0  48.96 42.48
##   1   3.51  5.05
barplot(with(heart, table(HeartDisease, Sex)), main="Heart Disease Among Male & Female", beside=T, col=3:2)

In our exploratory analysis, we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease.

BMI: Is there a difference between the mean BMI of men and woman in the overall population?

hist(heart$BMI,main="Histogram of BMI in Data Set", col='pink', xlab="BMI Levels")

In the histogram, we can see that our data is approximately normal and given the Central Limit Theorem, if n>30, then we can assume normality for the sake of statistical tests.

After sub-setting the data between male and female, we conducted a Welch’s t-test on the BMIs of both population subsets.

# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male  <- subset (heart, Sex == "Male")

# T-test of BMI between Genders 
t.test(female$BMI, male$BMI)
## 
##  Welch Two Sample t-test
## 
## data:  female$BMI and male$BMI
## t = -15, df = 3e+05, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.387 -0.299
## sample estimates:
## mean of x mean of y 
##      28.2      28.5

The p-value is less than the 5% significance level, allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34.

Next, we divided our data frame between men and women who do have heart disease. Once again assuming normality due to the large sample size, we conducted another Welch’s T-test to look at the mean BMI difference among the population with heart issues.

BMI: Is there a difference in average BMI between the sexes with heart disease?

# Subsetting genders based on respondents that have heart disease 
female_HD<- subset(female, HeartDisease==1)
male_HD<- subset(male, HeartDisease== 1)

#T-test
t.test(female_HD$BMI, male_HD$BMI)
## 
##  Welch Two Sample t-test
## 
## data:  female_HD$BMI and male_HD$BMI
## t = -0.5, df = 20988, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.208  0.121
## sample estimates:
## mean of x mean of y 
##      29.4      29.4

From the T-test output above, we see that the p-value is 0.60, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease.

Physical Health: How many times in the last 30 days have you felt physically ill?

Earlier, we saw that respondents, who have heart disease, reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. The box plot below shows us that women with heart disease generally report felling unwell more often than men.

ggplot(heart, aes(x = factor(HeartDisease), y = PhysicalHealth))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle('How many times in the last 30 days have you felt physically ill?') +
    xlab('Heart Disease') + 
    ylab('Physical Health')+
    scale_fill_brewer(palette = 'Set2', name = 'Gender')

t.test(male_HD$PhysicalHealth, female_HD$PhysicalHealth)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$PhysicalHealth and female_HD$PhysicalHealth
## t = -16, df = 23059, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.53 -1.97
## sample estimates:
## mean of x mean of y 
##      6.88      9.14

The t-test also confirms what the box plot is showing. Since the p-value is less than 5%, that there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick about 9 days while men reported just over 6 days.

Mental Health: How many times in the last 30 days have you felt mentally ill?

ggplot(heart, aes(x = factor(HeartDisease), y = MentalHealth))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
     ggtitle('How many times in the last 30 days have you felt mentally ill?') +
    xlab('Heart Disease') + 
    ylab('Mental Health')+
    scale_fill_brewer(palette = 'Set3', name = 'Sex')

t.test  (male_HD$MentalHealth,female_HD$MentalHealth)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$MentalHealth and female_HD$MentalHealth
## t = -22, df = 21050, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.79 -2.34
## sample estimates:
## mean of x mean of y 
##      3.59      6.16

The boxplot shows that women tend to report feeling mentally unwell more often than men.Since the p-value is less than 5%, indicating that there is a difference in the number of days that men and women feel mentally unwell. On average, women reported feeling mentally unwell about twice as many days as men. The cyclical nature of mental health and exercise has been covered extensively (Mikkelsen 2017; Raglin 1990); which assert that individuals struggling with mental health issues often are unable to complete physically demanding tasks, further making matters worse. This is because exercise has been shown to produce beneficial hormones that are associated with improvements in mental health.

Sleep Time: On average,how many hours of sleep do you get in a 24-hour period?

ggplot(heart, aes(x = factor(HeartDisease), y = SleepTime))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle(' On average,how many hours of sleep do you get in a 24-hour period?') +
    xlab('Heart Disease') + 
    ylab('Sleep Time')+ 
    scale_fill_brewer(palette = 'Set', name = 'Gender')

t.test(male_HD$SleepTime,female_HD$SleepTime)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$SleepTime and female_HD$SleepTime
## t = 4, df = 22378, p-value = 2e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0515 0.1389
## sample estimates:
## mean of x mean of y 
##      7.18      7.08

Visually it look like men and women get the same amount of sleep. However, the t-test shows that men and women have statistically different sleep times. Men on average get about 35 more minutes of sleep a week than women. This may not sound like much but sleep effects are known to be cumulative. A study conducted by Robbins et al. in 2021 “examined a group of people who got the recommended seven to eight hours of sleep a night to those who got less (even just 30 minutes less), and exposed them to a norovirus”. The study found that curtailing sleep below the recommended seven hours increases the risk for viral infection “nearly four times higher among those who are sleeping poorly or getting insufficient sleep, compared to those who are getting good rest” (Robbins et al, 2021).

This lack of sleep could explain why we see that women in our data on average report feeling sick more often and could also be a contributing factors is why women are more likely to die from heart attacks than men (Noel Bairey Merz).

Smoking:Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]

#Smoking 

ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle(' Have you smoked at least 100 cigarettes in your entire life?') +
    xlab('Heart Disease') + 
    ylab('Smoking')+ 
    scale_fill_brewer(palette = 'Set2', name = 'Gender')

t.test  ( male_HD$Smoking,female_HD$Smoking)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$Smoking and female_HD$Smoking
## t = 17, df = 23643, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0941 0.1178
## sample estimates:
## mean of x mean of y 
##     0.629     0.523

The number of men and women smoking in our data set are in equal proportions. However, the t-test shows that men average smoke slightly more than women but this difference does not seem significant.

Drinking: How many drinks have you had this week?

ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle('How many drinks have you had this week?') +
    xlab('Heart Disease') + 
    ylab('Alcohol Consumption')+ 
    scale_fill_brewer(palette = 'Set4', name = 'Gender')

t.test  ( male_HD$AlcoholDrinking,
        female_HD$AlcoholDrinking)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$AlcoholDrinking and female_HD$AlcoholDrinking
## t = 3, df = 25196, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00133 0.01083
## sample estimates:
## mean of x mean of y 
##    0.0442    0.0381

Note than in the questionnaire, heavy drinkers are defined as “adult men having more than 14 drinks per week and adult women having more than 7 drinks per week”. Even though, the number of drinkers are in equal proportions for both male and female, the number of drinks per between the populations are different. The threshold for being labelled a heavy drinker for women is half of that for a man.Therefore, the mean difference shown in the t-test might look small but on a per drink basis it is significant. Thus, we can conclude that men drink more on a weekly basis than women.

Age, Race, General Health

HD <- subset(heart, HeartDisease==1)

freq(HD$AgeCategory)

##            var frequency percentage cumulative_perc
## 1  80 or older      5449      19.91            19.9
## 2        70-74      4847      17.71            37.6
## 3        65-69      4101      14.98            52.6
## 4        75-79      4049      14.79            67.4
## 5        60-64      3327      12.15            79.5
## 6        55-59      2202       8.04            87.6
## 7        50-54      1383       5.05            92.6
## 8        45-49       744       2.72            95.4
## 9        40-44       486       1.78            97.1
## 10       35-39       296       1.08            98.2
## 11       30-34       226       0.83            99.0
## 12       25-29       133       0.49            99.5
## 13       18-24       130       0.47           100.0
freq(HD$Race)

##                              var frequency percentage cumulative_perc
## 1                          White     22507      82.22            82.2
## 2                          Black      1729       6.32            88.5
## 3                       Hispanic      1443       5.27            93.8
## 4                          Other       886       3.24            97.0
## 5 American Indian/Alaskan Native       542       1.98            99.0
## 6                          Asian       266       0.97           100.0
freq(HD$GenHealth)

##         var frequency percentage cumulative_perc
## 1      Good      9558      34.92            34.9
## 2      Fair      7084      25.88            60.8
## 3 Very good      5381      19.66            80.5
## 4      Poor      3850      14.06            94.5
## 5 Excellent      1500       5.48           100.0

When looking at the population make-up of respondents who do have heart disease, we see that a higher percentage of respondents are 60 and older, with 82% being of Caucasian descent. At the time of the interview they also said they felt their overall health is good. This may be a limiting factor for future research since, once again, the sample does not have many data points from non-Caucasian individuals. We also see that most of our respondents with heart disease are older above 60 years of age, this is in-line with theory.

Modelling

Even though a logit regression seems more appropriate to estimate whether or not an individual gets heart disease, we have used a linear regression in our exploratory stages to find the best fitting model.

Liner Regression on Sex

reg_1 <- lm(HeartDisease~Sex+BMI, data=heart)
print('Model One BIC')
## [1] "Model One BIC"
BIC(reg_1)
## [1] 90503
summary (reg_1)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2521 -0.1024 -0.0805 -0.0587  0.9677 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.03e-03   2.29e-03     2.2    0.028 *  
## SexMale     3.85e-02   9.87e-04    39.0   <2e-16 ***
## BMI         2.20e-03   7.76e-05    28.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 319792 degrees of freedom
## Multiple R-squared:  0.0074, Adjusted R-squared:  0.00739 
## F-statistic: 1.19e+03 on 2 and 319792 DF,  p-value: <2e-16
reg_2 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + SleepTime, data=heart)
print('Model Two BIC')
## [1] "Model Two BIC"
BIC(reg_2)
## [1] 80963
summary (reg_2)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth + 
##     SleepTime, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3585 -0.0918 -0.0726 -0.0425  1.0001 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.18e-02   3.41e-03   -6.40  1.5e-10 ***
## SexMale         4.22e-02   9.78e-04   43.10  < 2e-16 ***
## BMI             1.43e-03   7.70e-05   18.55  < 2e-16 ***
## PhysicalHealth  6.18e-03   6.41e-05   96.37  < 2e-16 ***
## MentalHealth   -4.95e-04   6.44e-05   -7.69  1.5e-14 ***
## SleepTime       3.95e-03   3.41e-04   11.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.275 on 319789 degrees of freedom
## Multiple R-squared:  0.0367, Adjusted R-squared:  0.0367 
## F-statistic: 2.44e+03 on 5 and 319789 DF,  p-value: <2e-16
reg_3 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + Smoking + AlcoholDrinking+ SleepTime + AgeCategory + Race, data=heart)
print('Model Three BIC')
## [1] "Model Three BIC"
BIC(reg_3)
## [1] 60289
summary (reg_3)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth + 
##     Smoking + AlcoholDrinking + SleepTime + AgeCategory + Race, 
##     data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4570 -0.1220 -0.0529 -0.0020  1.0648 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -6.16e-02   5.25e-03  -11.74  < 2e-16 ***
## SexMale                 4.88e-02   9.57e-04   50.95  < 2e-16 ***
## BMI                     2.05e-03   7.62e-05   26.92  < 2e-16 ***
## PhysicalHealth          4.61e-03   6.32e-05   72.95  < 2e-16 ***
## MentalHealth            9.31e-04   6.39e-05   14.56  < 2e-16 ***
## Smoking                 3.50e-02   9.92e-04   35.27  < 2e-16 ***
## AlcoholDrinking        -2.25e-02   1.89e-03  -11.93  < 2e-16 ***
## SleepTime              -1.39e-03   3.33e-04   -4.16  3.1e-05 ***
## AgeCategory25-29       -6.29e-03   2.75e-03   -2.29   0.0222 *  
## AgeCategory30-34       -6.45e-03   2.69e-03   -2.40   0.0166 *  
## AgeCategory35-39       -5.83e-03   2.64e-03   -2.21   0.0269 *  
## AgeCategory40-44        1.14e-03   2.63e-03    0.43   0.6643    
## AgeCategory45-49        1.10e-02   2.61e-03    4.20  2.6e-05 ***
## AgeCategory50-54        2.92e-02   2.52e-03   11.57  < 2e-16 ***
## AgeCategory55-59        4.63e-02   2.44e-03   18.96  < 2e-16 ***
## AgeCategory60-64        6.98e-02   2.39e-03   29.17  < 2e-16 ***
## AgeCategory65-69        9.48e-02   2.39e-03   39.60  < 2e-16 ***
## AgeCategory70-74        1.32e-01   2.44e-03   54.00  < 2e-16 ***
## AgeCategory75-79        1.65e-01   2.65e-03   62.18  < 2e-16 ***
## AgeCategory80 or older  2.08e-01   2.58e-03   80.59  < 2e-16 ***
## RaceAsian              -2.45e-02   4.75e-03   -5.15  2.5e-07 ***
## RaceBlack              -2.01e-02   4.09e-03   -4.91  9.1e-07 ***
## RaceHispanic           -1.75e-02   4.03e-03   -4.35  1.4e-05 ***
## RaceOther              -1.30e-02   4.48e-03   -2.90   0.0037 ** 
## RaceWhite              -2.04e-02   3.73e-03   -5.48  4.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.266 on 319770 degrees of freedom
## Multiple R-squared:  0.0977, Adjusted R-squared:  0.0976 
## F-statistic: 1.44e+03 on 24 and 319770 DF,  p-value: <2e-16

With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low. Note that the 40-44 years age category has insignificant p-values. Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.

VIF Table to select variables

ezids::xkablevif(reg_3, wide=FALSE)
VIFs of the model
V1
AgeCategory25-29 1.03
AgeCategory30-34 1.06
AgeCategory35-39 1.14
AgeCategory40-44 1.17
AgeCategory45-49 1.08
AgeCategory50-54 1.02
AgeCategory55-59 1.04
AgeCategory60-64 1.72
AgeCategory65-69 1.81
AgeCategory70-74 1.89
AgeCategory75-79 1.92
AgeCategory80 or older 1.95
AlcoholDrinking 2.10
BMI 2.28
MentalHealth 2.45
PhysicalHealth 2.47
RaceAsian 2.37
RaceBlack 1.99
RaceHispanic 2.10
RaceOther 2.51
RaceWhite 5.04
SexMale 5.77
SleepTime 3.00
Smoking 11.28

From, the VIF table, the VIF value for smoking is 11.28, this is undesirable and thus will be dropped. Therefore our preferred model is as follows:

HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory

We will train and test the model to see the cost analysis of our predictions.

Logistic Regression

In order to address the unbalanced classification, we subsetted our original data frame to create a new one (called Total) that had equal proportions of the Heart Disease category. The new dataset, called Total, has over 54000 observations that have been split into an 80-20 train and test sets.

Model Training

# Creating a new dataframe (called total) with equal proportions of Heart Disease values 
newdata1<- subset(heart, HeartDisease==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
## [1] 27373
newdata0<- subset(heart, HeartDisease==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
## [1] 27373
total <- rbind(newdata1_1, newdata0_1)
#nrow(total)
#summary(total)
#freq(total$HeartDisease)


# Convert to factor variables 
total$HeartDisease<-factor(total$HeartDisease)
total$Sex<-factor(total$Sex)

# Split into 80-20 train-test sets

set.seed(1234) 

sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test  <- total[-sample, ]

# Logit Model Training
model<- glm(HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory, data= train, family='binomial')
#summary(model)

#Chi-squared test for model fit 
anova(model, test = 'Chisq')
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: HeartDisease
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                            43795      60714             
## Sex              1      704     43794      60010  < 2e-16 ***
## BMI              1      510     43793      59499  < 2e-16 ***
## PhysicalHealth   1     2416     43792      57083  < 2e-16 ***
## MentalHealth     1       28     43791      57055  9.4e-08 ***
## AlcoholDrinking  1      124     43790      56931  < 2e-16 ***
## SleepTime        1       10     43789      56921   0.0016 ** 
## AgeCategory     12     9273     43777      47647  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the anova chi-squared results, we can conclude that the logit model (HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory) used is a good fit model for predicting the probability of getting heart disease. All the coefficient p-values in the model are also significant, except for the 25-39 age range, which is significant at the 10% level.

# Exponential Coefficients  & Confidence Interval 
exp_coeff<- exp(cbind(OR = coef(model), confint(model)))
knitr::kable(exp_coeff,caption='Odds Ratio & Confidence Interval of Model')
Odds Ratio & Confidence Interval of Model
OR 2.5 % 97.5 %
(Intercept) 0.018 0.014 0.023
SexMale 2.106 2.014 2.203
BMI 1.041 1.037 1.044
PhysicalHealth 1.042 1.039 1.044
MentalHealth 1.021 1.018 1.024
AlcoholDrinking 0.797 0.722 0.879
SleepTime 0.947 0.933 0.960
AgeCategory25-29 0.916 0.685 1.224
AgeCategory30-34 1.602 1.246 2.068
AgeCategory35-39 1.865 1.465 2.387
AgeCategory40-44 3.091 2.463 3.908
AgeCategory45-49 4.059 3.265 5.089
AgeCategory50-54 7.385 5.990 9.188
AgeCategory55-59 10.309 8.398 12.778
AgeCategory60-64 14.405 11.766 17.810
AgeCategory65-69 18.361 15.012 22.681
AgeCategory70-74 27.403 22.402 33.855
AgeCategory75-79 33.746 27.513 41.797
AgeCategory80 or older 51.239 41.794 63.438

According to the model, the odds for getting heart disease increases by 2.11 for a man compared to a woman. While an increase in BMI increases the odds for heart disease by 1.03. Mental health,physical health, drinking, and sleep time also affect your chances of getting heart disease but are not the most salient risk factors; they could be used as early indicators. As expected, the odds of getting heart disease for age increases as an individual gets older.

Also note, the confidence intervals for all the variables are narrow, indicating that they are good predictors of heart disease. However, the confidence intervals for age gets wider as we increase an individuals age. Although the uncertainty is greater at higher age ranges, there still may be enough precision to inform a patient about preventative measures, when considering their lifestyle choices.

Predicting and Assessing the Model

library(ggthemes)

# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction  <- predict( model, newdata = test , type = "response" )

# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) + 
theme_economist()

# Positive instance = 1 = Yes Heart Disease 
# Negative instance = 0 = No Heart Disease

After obtaining the predicted values that an individual will have heart disease on both the training and testing sets, a double density plot was drawn to show those predictions. Since we split our data evenly, the negative instances are skewed to the left while the positive instances are skewed to the right. This is an ideal density plot.

ROC-AUC

# user-defined different cost for false negative and false positive

source('unbalanced_functions.R')

cost.fp = 100
cost.fn = 200

roc_info <- ROCInfo( data = test, 
                     predict = "prediction", 
                               actual = "HeartDisease", 
                               cost.fp = 100,
                               cost.fn = 200 )
grid.draw(roc_info$plot)

From the plots, we can see that our Area Under the Curve is about 79%, which is an acceptable model for our purposes. Specifying our different costs for false positives ($100) and false negatives ($200), we see that our total loss cost is just under $382000.

Confusion Matrix

loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion Matrix from Logit Model" )
Confusion Matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 14653 7276 21929
Actual 1 4691 17176 21867
Total 19344 24452 43796
#unloadPkg("regclass")

# 
cm_info <- ConfusionMatrixInfo( data = test, 
                                predict = "prediction", 
                                actual  = 'HeartDisease', 
                                cutoff  = .30 )
#ggthemr("flat")
plot(cm_info$plot)

Calculated Values for Model:

Accuracy = 0.727 Mis-classification = 0.27 Sensitivity = 0.78 Specificity = 0.66

Using the optimal cut-off value of 0.30 obtained from the ROC-AUC Loss function, we see that our model a is able to calculate more positive values than negative values. The accuracy of the model is about 73%, which is not a good model in the healthcare field but can still be used as a public early detection tool to bring about individual awareness. We see that our model shows high sensitivity and low specificity, this means that our model is great at estimating actual cases of whether an individual will have heart disease but tends to have a high false positive rate. In the grand scheme life, higher false positives is not as bad as having a false negative rate. For instance, an individual A uses our model and is told that they have a higher likelihood of having heart issues in the future. However, turns out this is a false positive and A just ends up living a bit more healthier than they did before. However, it is the false negatives that we will need to adjust for in the future.

HeartDisease vs sencondary illnesses and health conditions(BMI and difficulty in walking)

In this part we are Identifying whether Heart Disease has any relation with secondary illnesses and also health conditions like BMI and Difficulty in Walking

heart_dum <-read.csv("heart_2020_cleaned.csv")
heart_dum$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart_dum$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"  

#Exploratory data analysis

library(ggplot2)
ggplot(heart, aes(HeartDisease)) +
  geom_bar(color="black",fill="aquamarine4")

ggplot(data = heart_dum) + 
geom_bar(mapping = aes(x = Stroke, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Diabetic")

ggplot(data = heart_dum) + 
geom_bar(mapping = aes(x = Diabetic, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Diabetic")

ggplot(data = heart_dum) + 
geom_bar(mapping = aes(x = KidneyDisease, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs KidneyDisease")

ggplot(data = heart_dum) + 
geom_bar(mapping = aes(x = SkinCancer, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs SkinCancer")

ggplot(data = heart_dum) + 
geom_bar(mapping = aes(x = Asthma, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Asthma")

ggplot(data = heart_dum) + 
geom_bar(mapping = aes(x = DiffWalking, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Difficulty in walking")

# We are seperating our desired variables to test them in an easier way.
causes <- data.frame (heart$HeartDisease,
                        heart$Stroke,
                        heart$Diabetic,
                        heart$Asthma,
                        heart$KidneyDisease,
                        heart$SkinCancer,
                        heart$DiffWalking)

# Rename columns

names(causes) <- c ('heart_disease',
                      'stroke',
                      'diabetic',
                      'asthma',
                      'kidney_disease',
                      'skin_cancer',
                       'Difficult_walking')

Corellation Matrix and Correlation Plot Between Diseases

cor_matrix<-cor(causes)
cor_matrix
##                   heart_disease stroke diabetic    asthma kidney_disease
## heart_disease            1.0000 0.1968   0.1782  0.041444         0.1452
## stroke                   0.1968 1.0000   0.1062  0.038866         0.0912
## diabetic                 0.1782 0.1062   1.0000  0.048559         0.1466
## asthma                   0.0414 0.0389   0.0486  1.000000         0.0397
## kidney_disease           0.1452 0.0912   0.1466  0.039707         1.0000
## skin_cancer              0.0933 0.0481   0.0377 -0.000396         0.0618
## Difficult_walking        0.2013 0.1741   0.2160  0.103222         0.1531
##                   skin_cancer Difficult_walking
## heart_disease        0.093317            0.2013
## stroke               0.048116            0.1741
## diabetic             0.037747            0.2160
## asthma              -0.000396            0.1032
## kidney_disease       0.061816            0.1531
## skin_cancer          1.000000            0.0648
## Difficult_walking    0.064840            1.0000
model.matrix(~0+., data=causes) %>% 
     cor(use="pairwise.complete.obs") %>% 
    ggcorrplot(show.diag = F, type="upper", lab=TRUE, lab_size=2)

From the above correlation test we can see that overall Correlations are not particularly strong. Heartdisease has more correlation with stroke than any other secondary illnesses(Which is not a surprise generally). Diabetes is correlated with heart disease and Kidney disease, Less with Stroke. Skin cancer shows negligible correlation with all other diseases. overall, we can see that only stroke, diabetes and kidney disease have considerable correlation with heart disease. Not only from our correlation test, but as per center for disease control and prevention, Common heart disorders can increase your risk for stroke. For example, coronary artery disease increases your risk for stroke, because plaque builds up in the arteries and blocks the flow of oxygen-rich blood to the brain. Also diabetes increases your risk for stroke(Stroke vs Diabetes in our correlation plot ).Diabetes causes sugars to build up in the blood and prevent oxygen and nutrients from getting to the various parts of your body, including your brain. High blood pressure is also common in people with diabetes. High blood pressure is the leading cause of stroke and is the main cause for increased risk of stroke among people with diabetes (CDC Stroke Fact sheet). (https://www.cdc.gov/stroke/conditions.htm#:~:text=Common%20heart%20disorders%20can%20increase,rich%20blood%20to%20the%20brain.)

Determining Independence Between Diseases

After finding the correlation between variables we are now performing a chi-suare test of independence to find out whether the varibles are dependent or independent with each other. The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables(All the variables we have here are already categorical). In a more general sense, it tests to see whether distributions of categorical variables differ from each another.The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable. The null hypothesis for this test is that there is no relationship between primary disease heartDisease and other secondary illnesses. The alternative hypothesis is that there is a relationship between heartDisease and secondary illnesses.

output: the title of the test, which variables have been used, the test statistic, the degrees of freedom and the p-value of the test.

hea_as=table(heart$HeartDisease,heart$Asthma)
chi_1=chisq.test(hea_as)
chi_1
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hea_as
## X-squared = 549, df = 1, p-value <2e-16
hea_dia=table(heart$HeartDisease,heart$Diabetic)
chi_2=chisq.test(hea_dia)
chi_2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hea_dia
## X-squared = 10151, df = 1, p-value <2e-16
hea_kid=table(heart$HeartDisease,heart$KidneyDisease)
chi_3=chisq.test(hea_kid)
chi_3
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hea_kid
## X-squared = 6739, df = 1, p-value <2e-16
hea_skin=table(heart$HeartDisease,heart$SkinCancer)
chi_4=chisq.test(hea_skin)
chi_4
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hea_skin
## X-squared = 2784, df = 1, p-value <2e-16
hea_stroke=table(heart$HeartDisease,heart$Stroke)
chi_5=chisq.test(hea_stroke)
chi_5
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hea_stroke
## X-squared = 12386, df = 1, p-value <2e-16
hea_walk=table(heart$HeartDisease,heart$DiffWalking)
chi_6=chisq.test(hea_walk)
chi_6
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hea_walk
## X-squared = 12951, df = 1, p-value <2e-16

From the output and from test$p.value we see that the p-value is less than the significance level of 5%. Like any other statistical test, if the p-value is less than the significance level, we can reject the null hypothesis.Every secondary illness (Diabetic,Asthma, Kidney disease and skin cancer) has a relationship with heart Disease.

Logistic Regression Model

library(tidyverse)
library(caret)
heartsub <- data.frame(heart)
heartsub <- heartsub[, c("HeartDisease","Stroke","Diabetic","Asthma","KidneyDisease","SkinCancer","DiffWalking","BMI")]

newdata1<- subset(heartsub, HeartDisease==1)
newdata1_1<-head(newdata1,20000)

newdata0<- subset(heartsub, HeartDisease==0)
newdata0_1<-head(newdata0,20000)

lokesh <- rbind(newdata1_1, newdata0_1)

set.seed(100)
heart_train_rows <- sample(1:nrow(lokesh), 
                         0.7*nrow(lokesh))
heart_train <- lokesh[heart_train_rows, ]
heart_test <-  lokesh[-heart_train_rows, ]
mod1 <- glm(HeartDisease ~ Stroke + Diabetic + Asthma + KidneyDisease 
              + SkinCancer+DiffWalking+BMI,
              data=heart_train,family = "binomial")
summary(mod1)
## 
## Call:
## glm(formula = HeartDisease ~ Stroke + Diabetic + Asthma + KidneyDisease + 
##     SkinCancer + DiffWalking + BMI, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.685  -0.893  -0.886   1.082   1.502  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.75273    0.06081  -12.38   <2e-16 ***
## Stroke         1.49484    0.05690   26.27   <2e-16 ***
## Diabetic       0.95070    0.03285   28.94   <2e-16 ***
## Asthma         0.04605    0.03693    1.25     0.21    
## KidneyDisease  0.87482    0.05707   15.33   <2e-16 ***
## SkinCancer     0.61962    0.03806   16.28   <2e-16 ***
## DiffWalking    0.92016    0.03267   28.16   <2e-16 ***
## BMI            0.00118    0.00211    0.56     0.58    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38816  on 27999  degrees of freedom
## Residual deviance: 34155  on 27992  degrees of freedom
## AIC: 34171
## 
## Number of Fisher Scoring iterations: 4
expcoeff = exp(coef(mod1))
# expcoeff
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
Exponential of coefficients in Logit Reg
x
(Intercept) 0.471
Stroke 4.459
Diabetic 2.587
Asthma 1.047
KidneyDisease 2.398
SkinCancer 1.858
DiffWalking 2.510
BMI 1.001

Confusion matrix

#loadPkg("regclass")
# confusion_matrix
xkabledply( confusion_matrix(mod1), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 10343 3724 14067
Actual 1 5246 8687 13933
Total 15589 12411 28000
#unloadPkg("regclass")

Accuracy - It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

loadPkg("pROC")
library(pROC)
library(ROCR)
test_prob <- predict(mod1, newdata = heart_test, type = "response")
train_prob <-predict(mod1, newdata = heart_train, type = "response")
test_roc <- roc(heart_test$HeartDisease ~ test_prob, plot = TRUE, print.auc = TRUE)

fittedheartTrain=ifelse(train_prob<0.5,0,1 )
ConfMatTrain=table(heart_train$HeartDisease,fittedheartTrain)
TrainAccuracy=(ConfMatTrain[1,1]+ConfMatTrain[2,2])/sum(ConfMatTrain)
TrainTPR=72/(72+11)
TrainFPR=10/(10+79)

print(cbind(TrainAccuracy, TrainTPR, TrainFPR))
##      TrainAccuracy TrainTPR TrainFPR
## [1,]          0.68    0.867    0.112
fittedheartTest=ifelse(test_prob<0.5,0,1 )
ConfMatTest=table(heart_test$HeartDisease,fittedheartTest)
TestAccuracy=(ConfMatTest[1,1]+ConfMatTest[2,2])/sum(ConfMatTest)
TestTPR=45/(45+10)
TestFPR=1-52/(52+7) ## expressed as 1-true negative rate

print(cbind(TestAccuracy, TestTPR, TestFPR))
##      TestAccuracy TestTPR TestFPR
## [1,]        0.687   0.818   0.119

Smokers Vs Alcoholic Drinkers

Our problem statement is “Does smoking or drinking increase the risk of heartdisease?, Does having diabetics increase the risk of heartdiseases among smokers or drinkers?

We performed some basic EDA through visualizing the dataset by plotting graphs. After preliminary analysis the following observations are made:

Heartdisease variable is heavily skewed. The percentage of people with no heartdiseases is very high as compared to people with heartdisease. As heartdisease is the target variable this might effect the prediction of the model.

In order to fix this we have made some subset of the dataset with equal proportions of yes and no of heart disease variable ##### Graphs

  • From the graph between heartdisease and smoking we can say that people who smoke have a high tendency for getting heartdiseases which is true with real life scenarios.

  • From the graph between heartdisease and drinking we can say the proposition of drinkers among people with heartdisease is less than the people without heartdisease. Ths is contrary to real life assumption which assumes that people who drink are more prone to heartdisease.

  • From the graph people with diabetics shows us that people with diabetics are more prone to heartdiseases.

  • From the graph between diabetics and smoking we can conclude that the chances of people having diabetics is more among smokers.

  • From the graph between diabetics and alcoholdrinking, there are less people with diabetics who drinks as compared to people with diabetics.

Conclusion from graphs

We have observed the following trends between heartdisease and smoking, alcoholdrinking and diabetics from the preliminary analysis:

The chances of people having heartdiseases is more when they have smoking habit as compared to drinking. people with diabetics are more prone to heartdiseases. Diabetics follows the same trend as heartdisease, where people who smoke are more prone to having diabetics than people who drink.

Correlation coefficients

After graphs we have done some statistical analysis of the variables. From the correlation coefficients between heartdisease and smoking, drinking, diabetics are 0.184, -0.0609, 0.266 respectively. Heartdisease and diabetics are strongly correlated when compared to drinking and smoking. Drinking is least and negatively corealted with heartdisease. The corelation coefficients between diabetics, and smoking, alcoholdrinking are 0.0733, -0.0705 respectively. Diabetics and smoking are lightly correlated with a positive coefficients of 0.073 and from the coefficient values we can say that smoking and drinking does not effect diabetic to a great extent.

Contengency table

From the contengency table, approximetly 30% of people who has heartdiseases has an smoking habit, only 20% of people who does not smoke have heartdisease. 2% of drinkers have heartdisease whereas 3% of known drinkers have heartdisease. This is the reason for negative correaltion between heartdisease and drinking. Out of total heart patients 18% have diabetics and 7% of people have diabetics without any heartdisease.

Model building

We built some models to see the statistical correlation between heartdisease and smoking, alcoholdrinking, and diabetic and try to predict the chances of having heartdisease with the above mentioned variables. Further, we checked the chances of having heartdisease has increased or decreased when diabetics is taken into consideration among smokers and drinkers.

We used logistic regression model to predict the chances of having heartdisease based on the variables as this is a classification problem. We built 2 models, model_lr and model_lr2.

model_lr has smoking and alcoholdrinking as independent variables and heartdisese as dependent variable with interaction between smoking and alocoholdrinking. In model_lr2 smoking, alcoholdrinking, diabetics are independent variables with interaction between diabetics and alcoholdrinking and diabetics and smoking.

Logistic regression model_lr

From p-values both smoking and alcoholdrinking are statistically significant. For every one percent increase in smoking there is an increase of 0.7891% in heartdisease. For every one percent increase in alcoholdrinking there is a decrease of 0.7844% in heartdisease. The high z-value 39.31 of smoking tells us that smoking and heardisease are strongly correaltion. The high p-value between smoking and alcoholdrinking signifies no statistical correlation between the two-variable.

Accuracy for model_lr

We have got an accuaracy of 58.9 for this model_lr which is not good because of the fact that we have a equal distribution of heartdisease patients in the dataset.

Expo coeffecients

Odd’s ratio is the proprobility of having heartdiseases to having no heartdiseases for a person with smoking habit. For every one unit increase in smoking the chance of getting heartdisease is 2.2. For every one unit increase in alcoholdrinking the chance of getting heartdisease is 0.45.

Confusion matrix

From confusion matrix we can see that our model has predicted true positive only 12157 times and 7867 times false positive. Despite having an close to 60% accuracy false positive rate is very high when compared with true positive. So, the model is not good.

Roc curve

From the roc-curve we can conclude that the accuracy is not good as it is close to diagnal.

Model_lr2

Now for model_lr2 all the 3 variables(smoking,drinking,diabetics) are statistically significant. The effect of diabetics on heartdisease with people having smoking habit is statistically significant. There is no strong relation between diabetics and alcoholdrinking in heartdisease patients.

Accuracy

The accuracy of the model_lr2 has increased when diabetics is added this shows the significances of diabetics in heartdisease patients.

Exponential coeffecients

For model_lr2 from the expontential coffectients, or every one unit increase in diabetics and smoking there is a 0.8 increase in heartdisease.

Confusion matrix

From the confusion matrix, true positive is 15421 and false negative is 9483 which is slightly better than model one but can not say but the accuracy is good as there is a lot of false negatives.

roc curve

Roc curve is slightly better than model_lr.

To conclude, model_lr2 performs slightly better than model_lr which signifies the a strong correlation between diabetics and heartdisease which is inturn strongly correlated in smoking. The correaltions predicted by the model between heartdisease and smoking, diabetics are similar to real life scenarios. Drinking is not significantly related with the heartdisease which is not true in real life situations. This arises from the fact that there are very few observations of true positives cases in alcoholdrinking. Accuracy of 58 and 63% are decent for the number of false positives are too high which makes the model unrealiable.

barplot(table(total$Smoking, total$HeartDisease), legend= rownames(table(total$Smoking, total$HeartDisease)), col =(c("orange", "blue")), xlab="heartdisease", ylab="smoking",main = "Barplot for Heartdisease and Smoking")

barplot(table(total$AlcoholDrinking, total$HeartDisease), legend= rownames(table(total$AlcoholDrinking, total$HeartDisease)), col =(c("red", "darkblue")), xlab="Heartdisease", ylab="Alcoholdrinking", main = "Barplot for Heartdisease and Alcoholdrinking")

barplot(table(total$Diabetic, total$HeartDisease), legend= rownames(table(total$Diabetic, total$HeartDisease)), col =(c("green", "brown")),xlab="HeartDisease",ylab="Diabetics", main = "Barplot for Heartdisease and Diabetic")

barplot(table(total$Diabetic, total$Smoking), legend= rownames(table(total$Diabetic, total$Smoking)), col =(c("red", "green")),xlab="Smoking",ylab="Diabetics", main = "Barplot for Smoking and Diabetic")

barplot(table(total$Diabetic, total$AlcoholDrinking), legend= rownames(table(total$Diabetic, total$AlcoholDrinking)), col =(c("red", "blue")),xlab="AlcoholDrinking",ylab="Diabetics", main = "Barplot for AlcoholDrinking and Diabetic")

total$HeartDisease <- as.numeric(total$HeartDisease)

cor_as_hd <- cor(total$HeartDisease, total$AlcoholDrinking)
cor_as_hd
## [1] -0.0609
cor_hd_smoke <- cor(total$HeartDisease, total$Smoking)
cor_hd_smoke
## [1] 0.184
cor_hd_dia <- cor(total$HeartDisease, total$Diabetic)
cor_hd_dia
## [1] 0.266
cor_dia_smoke <- cor(total$Diabetic, total$Smoking)
cor_dia_smoke
## [1] 0.0733
cor_dia_drink <- cor(total$Diabetic, total$AlcoholDrinking)
cor_dia_drink
## [1] -0.0705
contable  <- table(total$HeartDisease, total$Smoking)
prop <-prop.table(contable)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and Smoking in Percentage")
Contingency Table for heart Disease and Smoking in Percentage
0 1
29.9 20.1
20.7 29.3
contable2 <- table(total$HeartDisease, total$AlcoholDrinking)
prop <-prop.table(contable2)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and AlcoholDrinking in Percentage")
Contingency Table for heart Disease and AlcoholDrinking in Percentage
0 1
46.5 3.48
47.9 2.08
contable3 <- table(total$HeartDisease, total$Diabetic)
prop <-prop.table(contable3)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and Diabetics in Percentage")
Contingency Table for heart Disease and Diabetics in Percentage
0 1
43.6 6.39
32.2 17.80
model_lr <- glm(HeartDisease ~ Smoking + AlcoholDrinking + AlcoholDrinking:Smoking, data = train , family = "binomial")
summary(model_lr)
## 
## Call:
## glm(formula = HeartDisease ~ Smoking + AlcoholDrinking + AlcoholDrinking:Smoking, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.367  -1.031  -0.746   0.999   1.683  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.3538     0.0139  -25.44   <2e-16 ***
## Smoking                   0.7891     0.0201   39.31   <2e-16 ***
## AlcoholDrinking          -0.7844     0.0828   -9.47   <2e-16 ***
## Smoking:AlcoholDrinking   0.1050     0.0978    1.07     0.28    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 60714  on 43795  degrees of freedom
## Residual deviance: 58880  on 43792  degrees of freedom
## AIC: 58888
## 
## Number of Fisher Scoring iterations: 4
xkabledply(model_lr, title = paste("Logistic Regression :", format(formula(model_lr))))
Logistic Regression : HeartDisease ~ Smoking + AlcoholDrinking + AlcoholDrinking:Smoking
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.354 0.0139 -25.44 0.000
Smoking 0.789 0.0201 39.31 0.000
AlcoholDrinking -0.784 0.0828 -9.47 0.000
Smoking:AlcoholDrinking 0.105 0.0978 1.07 0.283
fitted.results <- predict(model_lr,newdata = test, type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$HeartDisease)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.589680365296804"
expcoeff = exp(coef(model_lr))
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
Exponential of coefficients in Logit Reg
x
(Intercept) 0.702
Smoking 2.201
AlcoholDrinking 0.456
Smoking:AlcoholDrinking 1.111
knitr::kable(expcoeff,caption='Odds Ratio & Confidence Interval of Model')
Odds Ratio & Confidence Interval of Model
x
(Intercept) 0.702
Smoking 2.201
AlcoholDrinking 0.456
Smoking:AlcoholDrinking 1.111

Using Confidence Intervals

xkabledply(confint(model_lr), title = "CIs using profiled log-likelihood" )
xkabledply(confint.default(model_lr), title = "CIs using standard errors" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) -0.3811 -0.327
Smoking 0.7497 0.828
AlcoholDrinking -0.9490 -0.624
Smoking:AlcoholDrinking -0.0852 0.298
CIs using standard errors
2.5 % 97.5 %
(Intercept) -0.3811 -0.327
Smoking 0.7497 0.828
AlcoholDrinking -0.9468 -0.622
Smoking:AlcoholDrinking -0.0866 0.297

Model evaluation

new_heart <- subset(heart,select = c(HeartDisease, Smoking, AlcoholDrinking, Diabetic))
cor_matrix<-cor(new_heart)
cor_matrix
##                 HeartDisease Smoking AlcoholDrinking Diabetic
## HeartDisease          1.0000  0.1078         -0.0321   0.1782
## Smoking               0.1078  1.0000          0.1118   0.0577
## AlcoholDrinking      -0.0321  0.1118          1.0000  -0.0579
## Diabetic              0.1782  0.0577         -0.0579   1.0000
model.matrix(~0+., data=new_heart) %>% 
     cor(use="pairwise.complete.obs") %>% 
    ggcorrplot(show.diag = F, type="upper", lab=TRUE, lab_size=3)

loadPkg("regclass")
xkabledply(confusion_matrix(model_lr), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 14062 7867 21929
Actual 1 9710 12157 21867
Total 23772 20024 43796

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

loadPkg("pROC") 


library(ROCR)
p <- predict(model_lr, newdata=test, type="response")
pr <- prediction(p, test$HeartDisease)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.593
#area of curve is 0.593
model_lr2 <- glm(HeartDisease ~ Smoking + AlcoholDrinking + Diabetic + Diabetic:Smoking + Diabetic:AlcoholDrinking, data = train , family = "binomial")
summary(model_lr2)
## 
## Call:
## glm(formula = HeartDisease ~ Smoking + AlcoholDrinking + Diabetic + 
##     Diabetic:Smoking + Diabetic:AlcoholDrinking, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.779  -0.914  -0.718   1.123   1.721  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.65649    0.01610  -40.77   <2e-16 ***
## Smoking                   0.78652    0.02281   34.49   <2e-16 ***
## AlcoholDrinking          -0.56685    0.04808  -11.79   <2e-16 ***
## Diabetic                  1.36178    0.03511   38.79   <2e-16 ***
## Smoking:Diabetic         -0.13908    0.05019   -2.77   0.0056 ** 
## AlcoholDrinking:Diabetic  0.00964    0.13462    0.07   0.9429    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 60714  on 43795  degrees of freedom
## Residual deviance: 55946  on 43790  degrees of freedom
## AIC: 55958
## 
## Number of Fisher Scoring iterations: 4
xkabledply(model_lr2, title = paste("Logistic Regression :", format(formula(model_lr2))))
Logistic Regression : HeartDisease ~ Smoking + AlcoholDrinking + Diabetic + Diabetic:Smoking +
Logistic Regression : Diabetic:AlcoholDrinking
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6565 0.0161 -40.7697 0.0000
Smoking 0.7865 0.0228 34.4882 0.0000
AlcoholDrinking -0.5669 0.0481 -11.7909 0.0000
Diabetic 1.3618 0.0351 38.7872 0.0000
Smoking:Diabetic -0.1391 0.0502 -2.7708 0.0056
AlcoholDrinking:Diabetic 0.0096 0.1346 0.0716 0.9429
fitted.results <- predict(model_lr2,newdata = test, type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$HeartDisease)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.629041095890411"
expcoeff2 = exp(coef(model_lr2))
xkabledply( as.table(expcoeff2), title = "Exponential of coefficients in Logit Reg" )
Exponential of coefficients in Logit Reg
x
(Intercept) 0.519
Smoking 2.196
AlcoholDrinking 0.567
Diabetic 3.903
Smoking:Diabetic 0.870
AlcoholDrinking:Diabetic 1.010
knitr::kable(expcoeff2,caption='Odds Ratio & Confidence Interval of Model')
Odds Ratio & Confidence Interval of Model
x
(Intercept) 0.519
Smoking 2.196
AlcoholDrinking 0.567
Diabetic 3.903
Smoking:Diabetic 0.870
AlcoholDrinking:Diabetic 1.010
xkabledply(confint(model_lr2), title = "CIs using profiled log-likelihood" )
xkabledply(confint.default(model_lr2), title = "CIs using standard errors" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) -0.688 -0.6250
Smoking 0.742 0.8313
AlcoholDrinking -0.661 -0.4730
Diabetic 1.293 1.4308
Smoking:Diabetic -0.237 -0.0406
AlcoholDrinking:Diabetic -0.252 0.2761
CIs using standard errors
2.5 % 97.5 %
(Intercept) -0.688 -0.6249
Smoking 0.742 0.8312
AlcoholDrinking -0.661 -0.4726
Diabetic 1.293 1.4306
Smoking:Diabetic -0.237 -0.0407
AlcoholDrinking:Diabetic -0.254 0.2735

Model evaluation

Confusion matrix

#loadPkg("regclass")
xkabledply( confusion_matrix(model_lr2), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 12446 9483 21929
Actual 1 6446 15421 21867
Total 18892 24904 43796
loadPkg("pROC") 


library(ROCR)
p1 <- predict(model_lr2, newdata=test, type="response")
pr1 <- prediction(p1, test$HeartDisease)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.593

GENERALHEALTH, PHYSICAL ACTIVITY, ETHNICITY AFFECTS HEART DISEASE

Converting General Health as numericals

#plot_num(heart)
heart$GenHealth<-as.factor(heart$GenHealth)
heart$GeneralHealth<-sapply(heart$GenHealth,unclass)

#5-Very good,4-poor,3-Good,2-fair,1-excellent

General Health is a categorical variable having very good, poor, fair, good, excellent. Converted categorical variable into numerical dummies for our simple analysis. General Health with very good as 5,General Health with poor as 4, General Health with good as 3, General Health with fair as 2, General Health with excellent as 1.

Frequency distributions of all our categorical variables

freq(heart)

##      Sex frequency percentage cumulative_perc
## 1 Female    167805       52.5            52.5
## 2   Male    151990       47.5           100.0

##    AgeCategory frequency percentage cumulative_perc
## 1        65-69     34151      10.68            10.7
## 2        60-64     33686      10.53            21.2
## 3        70-74     31065       9.71            30.9
## 4        55-59     29757       9.31            40.2
## 5        50-54     25382       7.94            48.2
## 6  80 or older     24153       7.55            55.7
## 7        45-49     21791       6.81            62.5
## 8        75-79     21482       6.72            69.2
## 9        18-24     21064       6.59            75.8
## 10       40-44     21006       6.57            82.4
## 11       35-39     20550       6.43            88.8
## 12       30-34     18753       5.86            94.7
## 13       25-29     16955       5.30           100.0

##                             Race frequency percentage cumulative_perc
## 1                          White    245212      76.68            76.7
## 2                       Hispanic     27446       8.58            85.3
## 3                          Black     22939       7.17            92.4
## 4                          Other     10928       3.42            95.9
## 5                          Asian      8068       2.52            98.4
## 6 American Indian/Alaskan Native      5202       1.63           100.0

##   GenHealth frequency percentage cumulative_perc
## 1 Very good    113858      35.60            35.6
## 2      Good     93129      29.12            64.7
## 3 Excellent     66842      20.90            85.6
## 4      Fair     34677      10.84            96.5
## 5      Poor     11289       3.53           100.0
## [1] "Variables processed: Sex, AgeCategory, Race, GenHealth"

Changing Diabetes bordeline and pregnancy categories to Yes/No

heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"  

#Checking if diabetic has been changed

describe (heart$Diabetic)
## heart$Diabetic 
##        n  missing distinct 
##   319795        0        2 
##                         
## Value           0      1
## Frequency  272212  47583
## Proportion  0.851  0.149
freq(heart$Diabetic)

##   var frequency percentage cumulative_perc
## 1   0    272212       85.1            85.1
## 2   1     47583       14.9           100.0

Changing all the categorical variables to numerical dummies

heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)

Changing race variable to numerical variable

heart$Race<-as.factor(heart$Race)
heart$Ethnicity<-sapply(heart$Race,unclass)

Converting white as 6, Hispanic as 5,Black as 4, Other as 3, Asian as 2, American Indian as 1

Race is a categorical variable having all the race. Converted Race into numerical dummies for EDA .Converted white as 6, Hispanic as 5,Black as 4, Other as 3, Asian as 2, American Indian as 1.

freq(heart$Ethnicity)

##   var frequency percentage cumulative_perc
## 1   6    245212      76.68            76.7
## 2   4     27446       8.58            85.3
## 3   3     22939       7.17            92.4
## 4   5     10928       3.42            95.9
## 5   2      8068       2.52            98.4
## 6   1      5202       1.63           100.0

In our dataset most of people are White, next highest percentage is black, next people are others,least are indians.

loadPkg("corrplot")
cor(heart$HeartDisease,heart$GeneralHealth)
## [1] -0.0111

As a rule of thumb, a correlation coefficient between 0.25 and 0.5 is considered to be a “weak” correlation between two variables. As Heart Disease increase General Health Decreases. It means chances of getting heart disease are high when general health decreases it means going bad.

contable = table(heart$GeneralHealth, heart$HeartDisease)
xkabledply(contable, title="Contingency Table for Heart Disease and GeneralHealth")
Contingency Table for Heart Disease and GeneralHealth
0 1
65342 1500
27593 7084
83571 9558
7439 3850
108477 5381
addmargins(contable)
##      
##            0      1    Sum
##   1    65342   1500  66842
##   2    27593   7084  34677
##   3    83571   9558  93129
##   4     7439   3850  11289
##   5   108477   5381 113858
##   Sum 292422  27373 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)

print ("Contingency Table for heart Disease and General Health in Percentage")
## [1] "Contingency Table for heart Disease and General Health in Percentage"
print (percent)
##    
##         0     1
##   1 20.43  0.47
##   2  8.63  2.22
##   3 26.13  2.99
##   4  2.33  1.20
##   5 33.92  1.68
barplot(with(heart, table(GeneralHealth, HeartDisease)), main="heart Disease & GeneralHealth", beside=T, col=3:2)

Contigency plot is made to know the relationship between heart Disease and General Health. For General Health 1 20.43% of people haven’t faced heart Disease and 0.47 have faced HeartDisease,For General HEalth 2 8.63% of people haven’t faced heart Disease and 2.22 have faced HeartDisease. For General Health 3 of people haven’t faced heart Disease and 0.47 have face HD For General HEalth 4 2.33% of people haven’t faced heart Disease and 1.20 have face HD. For General HEalth 5 33.92% of people haven’t faced heart Disease and 1.68 have face Heart disease.

barplot(table(heart$HeartDisease, heart$GeneralHealth), legend= rownames(table(heart$HeartDisease, heart$GeneralHealth)), col =(c("blue", "red")), xlab="GeneralHealth", ylab="heartdisease",main = "Barplot for Heartdisease and GeneralHealth")

A plot is drawn to get relationship with GeneralHealth and HeartDisease. From stacked barplot we can see that people with fair genearl health and good general health tend to have more heart disease than compared to all.

barplot(table(heart$HeartDisease, heart$PhysicalActivity), legend= rownames(table(heart$HeartDisease, heart$PhysicalActivity)), col =(c("orange", "blue")), xlab="PhysicalActivity", ylab="heartdisease",main = "Barplot for Heartdisease and PhysicalActivity")

We have a larger number of observations where the person is physically active. We can take away from this barplot that there is some significance for physical activity in deciding whether or not a person will develop heart disease

Genh_HD=table(heart$GeneralHealth,heart$HeartDisease)
chi_1=chisq.test(Genh_HD)
chi_1
## 
##  Pearson's Chi-squared test
## 
## data:  Genh_HD
## X-squared = 21542, df = 4, p-value <2e-16
HD_Eth=table(heart$PhysicalActivity,heart$HeartDisease)
chi_2=chisq.test(HD_Eth)
chi_2
## 
##  Chi-squared test for given probabilities
## 
## data:  HD_Eth
## X-squared = 2e+05, df = 1, p-value <2e-16

For all the chi^2 tests we performed above, the p-value<0.05. So all the above rejects null hypothesis. Every General Health,Physical Activity has a relationship with heart Disease.

modifieddata<- subset(heart, HeartDisease==1)
modifieddata_1<-head(modifieddata,27373)
nrow(modifieddata_1)
## [1] 27373
modifieddata0<- subset(heart, HeartDisease==0)
modifieddata_01<-head(modifieddata0,27373)
nrow(modifieddata_01)
## [1] 27373
totaldata <- rbind(modifieddata_1, modifieddata_01)
nrow(totaldata)
## [1] 54746
summary(totaldata)
##   HeartDisease      BMI          Smoking  AlcoholDrinking     Stroke 
##  Min.   :0.0   Min.   :12.2   Min.   :0   Min.   :0       Min.   :0  
##  1st Qu.:0.0   1st Qu.:24.3   1st Qu.:0   1st Qu.:0       1st Qu.:0  
##  Median :0.5   Median :27.5   Median :0   Median :0       Median :0  
##  Mean   :0.5   Mean   :28.7   Mean   :0   Mean   :0       Mean   :0  
##  3rd Qu.:1.0   3rd Qu.:31.9   3rd Qu.:0   3rd Qu.:0       3rd Qu.:0  
##  Max.   :1.0   Max.   :83.3   Max.   :0   Max.   :0       Max.   :0  
##  PhysicalHealth  MentalHealth    DiffWalking     Sex           
##  Min.   : 0.0   Min.   : 0.00   Min.   :0    Length:54746      
##  1st Qu.: 0.0   1st Qu.: 0.00   1st Qu.:0    Class :character  
##  Median : 0.0   Median : 0.00   Median :0    Mode  :character  
##  Mean   : 5.5   Mean   : 4.27   Mean   :0                      
##  3rd Qu.: 5.0   3rd Qu.: 4.00   3rd Qu.:0                      
##  Max.   :30.0   Max.   :30.00   Max.   :0                      
##  AgeCategory                                    Race          Diabetic
##  Length:54746       American Indian/Alaskan Native: 1510   Min.   :0  
##  Class :character   Asian                         :  976   1st Qu.:0  
##  Mode  :character   Black                         : 3684   Median :0  
##                     Hispanic                      : 4941   Mean   :0  
##                     Other                         : 1831   3rd Qu.:0  
##                     White                         :41804   Max.   :0  
##  PhysicalActivity     GenHealth       SleepTime         Asthma  KidneyDisease
##  Min.   :0        Excellent: 7576   Min.   : 1.00   Min.   :0   Min.   :0    
##  1st Qu.:0        Fair     : 9884   1st Qu.: 6.00   1st Qu.:0   1st Qu.:0    
##  Median :0        Good     :17318   Median : 7.00   Median :0   Median :0    
##  Mean   :0        Poor     : 4658   Mean   : 7.14   Mean   :0   Mean   :0    
##  3rd Qu.:0        Very good:15310   3rd Qu.: 8.00   3rd Qu.:0   3rd Qu.:0    
##  Max.   :0                          Max.   :24.00   Max.   :0   Max.   :0    
##    SkinCancer GeneralHealth    Ethnicity   
##  Min.   :0    Min.   :1.00   Min.   :1.00  
##  1st Qu.:0    1st Qu.:2.00   1st Qu.:6.00  
##  Median :0    Median :3.00   Median :6.00  
##  Mean   :0    Mean   :3.19   Mean   :5.37  
##  3rd Qu.:0    3rd Qu.:5.00   3rd Qu.:6.00  
##  Max.   :0    Max.   :5.00   Max.   :6.00
set.seed(100)
sample_train_rows <- sample(1:nrow(total), 0.7*nrow(totaldata))
train <- totaldata[sample_train_rows, ]
test <- totaldata[-sample_train_rows, ]

Since most of the data is unbiased. Converted data into equal number of yes(1) and no so that data is biased and we are traing the modified data to 70% and testing data as 30%

model <- glm(HeartDisease ~ GenHealth+SleepTime+Race,data=train,family = "binomial")
summary(model)
## 
## Call:
## glm(formula = HeartDisease ~ GenHealth + SleepTime + Race, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.128  -0.981  -0.391   0.997   2.334  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.48242    0.09414  -26.37  < 2e-16 ***
## GenHealthFair       2.43667    0.04469   54.52  < 2e-16 ***
## GenHealthGood       1.66261    0.03966   41.92  < 2e-16 ***
## GenHealthPoor       3.06311    0.05908   51.84  < 2e-16 ***
## GenHealthVery good  0.77326    0.04046   19.11  < 2e-16 ***
## SleepTime           0.03152    0.00723    4.36  1.3e-05 ***
## RaceAsian           0.10993    0.11410    0.96   0.3353    
## RaceBlack           0.42937    0.08135    5.28  1.3e-07 ***
## RaceHispanic       -0.23738    0.08061   -2.94   0.0032 ** 
## RaceOther           0.70207    0.09270    7.57  3.6e-14 ***
## RaceWhite           1.00724    0.07078   14.23  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 53125  on 38321  degrees of freedom
## Residual deviance: 45748  on 38311  degrees of freedom
## AIC: 45770
## 
## Number of Fisher Scoring iterations: 4

A model is built on Gen Health, Sleeptime, Race. All the p-values are less than 0.05 except the RaceAsian remaining all the stastically significant.

xkabledply(model, title = paste("Logistic Regression :", format(formula(model)) ))
Logistic Regression : HeartDisease ~ GenHealth + SleepTime + Race
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4824 0.0941 -26.370 0.0000
GenHealthFair 2.4367 0.0447 54.519 0.0000
GenHealthGood 1.6626 0.0397 41.923 0.0000
GenHealthPoor 3.0631 0.0591 51.845 0.0000
GenHealthVery good 0.7733 0.0405 19.110 0.0000
SleepTime 0.0315 0.0072 4.360 0.0000
RaceAsian 0.1099 0.1141 0.964 0.3353
RaceBlack 0.4294 0.0813 5.278 0.0000
RaceHispanic -0.2374 0.0806 -2.945 0.0032
RaceOther 0.7021 0.0927 7.573 0.0000
RaceWhite 1.0072 0.0708 14.231 0.0000

All the coefficients are found significant (small p-values). All the variables except PhysicalHealth have positive effects on HeartDisease.

expcoeff = exp(coef(model))
# expcoeff
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
Exponential of coefficients in Logit Reg
x
(Intercept) 0.0835
GenHealthFair 11.4350
GenHealthGood 5.2731
GenHealthPoor 21.3940
GenHealthVery good 2.1668
SleepTime 1.0320
RaceAsian 1.1162
RaceBlack 1.5363
RaceHispanic 0.7887
RaceOther 2.0179
RaceWhite 2.7380

Confusion matrix

#loadPkg("regclass")
# confusion_matrix
xkabledply( confusion_matrix(model), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 13326 5936 19262
Actual 1 5993 13067 19060
Total 19319 19003 38322
#unloadPkg("regclass")

Accuracy - It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

Accuracy-66.6

loadPkg("pROC")
library(pROC)
library(ROCR)
test_prob <- predict(model, newdata = test, type = "response")
train_prob <-predict(model, newdata = train, type = "response")
test_roc <- roc(test$HeartDisease ~ test_prob, plot = TRUE, print.auc = TRUE)

The area under curve is 73.9%

fittedheartTrain=ifelse(train_prob<0.5,0,1 )
ConfMatTrain=table(train$HeartDisease,fittedheartTrain)
TrainAccuracy=(ConfMatTrain[1,1]+ConfMatTrain[2,2])/sum(ConfMatTrain)
TrainTPR=72/(72+11)
TrainFPR=10/(10+79)

print(cbind(TrainAccuracy, TrainTPR, TrainFPR))
##      TrainAccuracy TrainTPR TrainFPR
## [1,]         0.689    0.867    0.112
fittedheartTest=ifelse(test_prob<0.5,0,1 )
ConfMatTest=table(test$HeartDisease,fittedheartTest)
TestAccuracy=(ConfMatTest[1,1]+ConfMatTest[2,2])/sum(ConfMatTest)
TestTPR=45/(45+10)
TestFPR=1-52/(52+7) ## expressed as 1-true negative rate

print(cbind(TestAccuracy, TestTPR, TestFPR))
##      TestAccuracy TestTPR TestFPR
## [1,]        0.684   0.818   0.119

Limitations of Dataset:

Our dataset contains 18 variables, for the EDA Analysis we have used only a few variables for our analysis and drew some conclusions and we may use all the variables for the final that is building the model.

Telephone interviews- voluntary bias (old retired): Most of the surveys were made telephonic because of the covid and most of the people didn’t want to meet in person to risk their lives for the sake of surveys & Covid. So, they started doing telephonic surveys to find out whether the key indicators like high blood pressure, high cholesterol, and smoking among most racial and ethnic groups to know how they play a crucial role in Heart Disease. Most of the responses were rude and arrogant because they were frustrated with the covid and were not getting out of the home. Few people were very nice to surveys and were cool and they responded by stating to questions they have given relevant answers to and surveys have recorded their responses.

Liars - We are uncertain about the responses because we found that the responses which were recorded were different at times. We found that when we called them for the survey we got the responses as one and while we called them and said that we may be giving them some medical discounts we found the responses as other.

During the year 2021, the covid was at its peak and most people got affected by covid. Post covid, people’s immune systems got broken down and they became prone to several illnesses like diabetes, Asthma. The death rates with covid were very high and this led to the increase in death rates of heart disease because most of the people got infected with key indicators like high BMI and high cholesterol. Not only did the older people got affected by covid it was the younger people who got affected by covid their death rates were also high. From the noted survey, people tend to develop more secondary illnesses which in turn caused them primary disease which is Heart Disease, And got affected with higher death rates.

Analysis of SMART Questions:

From the first question one Which gender tends to have more diseases based on BMI, smoking, drinking, and mental health in the year 2020? We concluded that On average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in 30 days is slightly higher for people with heart disease than for people without. we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34. From the T-test output, we see that the p-value is 0.604, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease. Since the p-value is less than 5%, there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick for about 9 days while men reported just over 6 days.

#From the second smart question Are people with heart diseases more prone to developing secondary illnesses’ or is it the other way around? we concluded that secondary illnesses are significantly dependent on heart diseases. We concluded this based on the Chi-squared test we observed that the p-value<0.05. We can reject the null hypothesis. Every secondary illness (Diabetic, Asthma, Kidney disease, and skin cancer) has a relationship with heart disease.

From the third smart question Are smokers or drinkers more prone to developing heart disease? Our analysis was based on the test not to conclude any kind of assumptions or theory we did some chi-squared tests For smoking and drinking the chi^2 tests we performed above, the p-value<0.05. So we can reject the null hypothesis. Both drinking and smoking have a relationship with heart disease.

From all the smart questions we can come to conclude that smokers, drinkers, and people with secondary illnesses are more prone to develop heart diseases and they are high chances of death rates. However, people with high BMI are unhealthy and most often tend to sick.

Next Steps: Next steps are to use what we learned in our dataset and develop a linear model. And try to create a model which is helpful for everyone. As we all know Health is Wealth and without health, we can’t achieve heights so we tend to develop a model which we can use to know which variables are developing heart disease and with that, we can try to decrease the effect on Heart Disease.

Conclusions:

To conclude, we found strong evidence from our EDA that people with higher BMI, high cholesterol, and age affect Heart Disease. Moreover, the variables were Sex, BMI, Physical Health, Mental Health, sleep time, Secondary Illnesses, alcohol, and Drinking.

References

Ara S. A literature review of cardiovascular disease management programs in managed care populations. Journal of Managed Care Pharmacy 2004; 10(4): 326-344.

Centers for Disease Control and Prevention. Underlying Cause of Death, 1999–2018. CDC WONDER Online Database. Atlanta, GA: Centers for Disease Control and Prevention; 2018.

CDC. (2019, December 2). Heart Disease Facts | cdc.gov. Centers for Disease Control and Prevention.

Giampaoli S, Palmieri L, Mattiello A, et al. Definition of high risk individuals to optimize strategies for primary prevention of cardiovascular diseases. Nutr Metab Cardiovasc Dis 2005;15:79–85.

Here’s What an Extra 20 Minutes of Sleep Does to Your Brain. (2020, July 17). Well+Good. https://www.wellandgood.com/benefits-of-getting-more-sleep/

Hong, Y. (2018). Framingham Heart Study (FHS) [Review of Framingham Heart Study (FHS)]. National Heart, Lung and Blood Institute. https://www.nhlbi.nih.gov/science/framingham-heart-study-fhs

Liau, S. Y., Mohamed Izham, M. I., Hassali, M. A., & Shafie, A. A. (2010). A literature review of the cardiovascular risk-assessment tools: applicability among Asian population. Heart Asia, 2(1), 15–18. https://doi.org/10.1136/ha.2009.001115

Robbins, R., Quan, S. F., Weaver, M. D., Bormes, G., Barger, L. K., & Czeisler, C. A. (2021). Examining sleep deficiency and disturbance and their risk for incident dementia and all-cause mortality in older adults across 5 years in the United States. Aging, 13(3), 3254–3268. https://doi.org/10.18632/aging.202591

Sadick, B. (2019, April). Women Die From Heart Attacks More Often Than Men. Here’s Why — and What Doctors Are Doing About It. Time; Time. https://time.com/5499872/women-heart-disease/

Bethesda, MD: National Institutes of Health. https://www.cdc.gov/stroke/conditions.htm#:~:text=Common%20heart%20disorders%20can%20increase,rich%20blood%20to%20the%20brain.

Antoine Soetewey https://statsandr.com/blog/chi-square-test-of-independence-in-r/

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/chi-square/